Efficient Dynamic Dictionary Matching with DAWGs and AC-automata
نویسندگان
چکیده
The dictionary matching is a task to find all occurrences of pattern strings in a set D (called a dictionary) on a text string T . The Aho-Corasick-automaton (AC-automaton) which is built on D is a fundamental data structure which enables us to solve the dictionary matching problem in O(d log σ) preprocessing time and O(n log σ + occ) matching time, where d is the total length of the patterns in the dictionary D, n is the length of the text, σ is the alphabet size, and occ is the total number of occurrences of all the patterns in the text. The dynamic dictionary matching is a variant where patterns may dynamically be inserted into and deleted from the dictionary D. This problem is called semi-dynamic dictionary matching if only insertions are allowed. In this paper, we propose two efficient algorithms that can solve both problems with some modifications. For a pattern of length m, our first algorithm supports insertions in O(m log σ + log d/ log log d) time and pattern matching in O(n log σ + occ) for the semi-dynamic setting. This algorithm also supports both insertions and deletions in O(σm + log d/ log log d) time and pattern matching in O(n(log d/ log log d+ log σ) + occ(log d/ log log d)) time for the dynamic dictionary matching problem by some modifications. This algorithm is based on the directed acyclic word graph (DAWG) of Blumer et al. (JACM 1987). Our second algorithm, which is based on the AC-automaton, supports insertions in O(m log σ+uf+uo) time for the semi-dynamic setting and supports both insertions and deletions in O(σm + uf + uo) time for the dynamic setting, where uf and uo respectively denote the numbers of states of which the failure function and the output function need to be updated. This algorithm performs pattern matching in O(n log σ+occ) time for both settings. Our algorithm achieves optimal update time for AC-automaton based methods, since any algorithm which explicitly maintains the AC-automaton requires Ω(uf + uo) update time. Keywords— dynamic dictionary matching, AC-automaton, DAWG
منابع مشابه
0 O ct 2 01 7 Efficient Dynamic Dictionary Matching with DAWGs and AC - automata
The dictionary matching is a task to find all occurrences of pattern strings in a set D (called a dictionary) on a text string T . The Aho-Corasick-automaton (AC-automaton) which is built on D is a fundamental data structure which enables us to solve the dictionary matching problem in O(d log σ) preprocessing time and O(n log σ + occ) matching time, where d is the total length of the patterns i...
متن کاملApproximate String Matching by Finite Automata
Abs t r ac t . Approximate string matching is a sequential problem and therefore it is possible to solve it using finite automata. A nondeterministic finite automaton is constructed for string matching with k mismatches. It is shown, how "dynamic programming" and "shift-and" based algorithms simulate this nondeterministic finite automaton. The corresponding deterministic finite automaton have O...
متن کاملTernary Directed Acyclic Word Graphs
Given a set S of strings, a DFA accepting S offers a very time-efficient solution to the pattern matching problem over S. The key is how to implement such a DFA in the trade-off between time and space, and especially the choice of how to implement the transitions of each state is critical. Bentley and Sedgewick proposed an effective tree structure called ternary trees. The idea of ternary trees...
متن کاملText Disambiguation By Finite State Automata, An Algorithm And Experiments On Corpora
The exploration of the context should provide clues that eliminate the non-relevant solutions. For this purpose we use local grammar constraints represented by finite automata. We have designed and implemented an algorithm which performs this task by using a large variety of linguistic constraints. Both the texts and the rules (or constraints) are represented in the same formalism, that is fini...
متن کاملInexact Pattern Matching Algorithms via Automata
Pattern matching occurs in various applications, ranging from simple text searching in word processors to identification of common motifs in DNA sequences in computational biology. The problem of exact pattern matching has been well studied and a number of efficient algorithms exist. However these exact pattern matching algorithms are of little help when they are applied to finding patterns in ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1710.03395 شماره
صفحات -
تاریخ انتشار 2017